The 2015 Canadian federal election was a long one (by Canadian standards anyway!). It generated a lot of press and we thought it would be a great opportunity to collect and play with that data.
Each party has a different view on what is important to move Canada forward. This has the effect of centering their official message around certain themes. The question we asked is: can we find patterns in the words used in each party’s press releases? Do some deal more with certain themes than others? Can we gain insights into their communication strategies?
This demo is based on a blog post by François Maillet from right after the campaign.
In [1]:
from pymldb import Connection
mldb = Connection()
We wrote a small scraper using the scrapy Python library. We scraped the press releases of the NDP, the Liberals, the Green Party and the Conservatives. The resulting data was save in a CSV file with one press release per line.
In [2]:
print mldb.post('/v1/procedures', {
'type': 'import.text',
'params': {
'dataFileUrl': 'http://public.mldb.ai/datasets/press_releases/data.csv.gz',
'outputDataset': {'type': 'sparse.mutable',
'id':'raw'},
'named': "'pr' + lineNumber()",
'ignoreBadLines': True,
'where': "TIMESTAMP date > TIMESTAMP '2015-08-04'"
}
})
Here is what the raw data look like.
In [3]:
mldb.query('SELECT * FROM raw LIMIT 5')
Out[3]:
Interested in knowing which parties have had the more press releases during the campaing? Easy!
In [4]:
mldb.query("""
SELECT count(*) as count
NAMED party
FROM raw
GROUP BY party
ORDER BY count DESC
""")
Out[4]:
Before continuing, we will create a dataset of dummy articles that are in fact recurring themes we have observed in the data. We will add those to the data to see where they map on the final plot.
In [5]:
print mldb.post('/v1/procedures', {
'type': 'transform',
'params': {
'inputData': """
SELECT
column AS full_text,
column AS title,
'keyword' AS party
NAMED 'kw' + rowName()
FROM (SELECT * FROM row_dataset({
"Justin Trudeau": 1,
"middle class": 1,
"syrian refugees": 1,
"terrorism": 1,
"infrastructures": 1,
"health": 1,
"crime": 1,
"jobs": 1,
"education": 1,
"oil": 1,
"deficit":1,
"youth": 1,
"holidays": 1,
"scandal": 1,
"tax cuts": 1,
"first nations": 1,
"debate": 1,
"small businesses": 1,
"military": 1
}))
""",
'outputDataset': {'id': 'keywords', 'type': 'tabular'}
}
})
And here is what that fancy API call built
In [6]:
mldb.query('SELECT * FROM keywords LIMIT 5')
Out[6]:
The next step is to take the full_text
column and tokenize its contents. We do this because we will need to compare press releases using their content word by word.
The tokenize
function first splits the columns into (lowercased) individual words, removing words with a length of less than 3 characters. We then remove a list of stopwords (such as "and", "to", "the", etc.).
We do this on the merge
d dataset comprised of the press releases and of our keywords.
In [7]:
print mldb.put('/v1/functions/filter_stopwords', {
'type': 'filter_stopwords'
})
print mldb.post('/v1/procedures', {
'type': 'transform',
'params': {
'inputData': """
SELECT
filter_stopwords({
words: {
tokenize(
lower(full_text),
{splitChars: ' -''"?!;:/[]*,.',
minTokenLength: 3}
) as *
}
})[words] as *
FROM merge(raw, keywords)
WHERE party = 'keyword'
OR (full_text IS NOT NULL
AND title IS NOT NULL)
""",
'outputDataset': 'bag_of_words'
}
})
The result is a clean list of words that will be easy to compare across press releases. Cells that contain a number represent the number of times that word was in the press release while cells containing NaN
indicate that the press release did not contain the given word. Because MLDB supports sparse datasets, it is more efficient to simply not record anything when a word is not in a press release, rather than write an explicit "0" in the dataset.
In [8]:
mldb.query("SELECT * FROM bag_of_words LIMIT 5")
Out[8]:
We can easily answer some questions from that new dataset with simple queries, like what are the most common words in our press releases.
In [9]:
%matplotlib inline
mldb.query("""
SELECT *
FROM transpose((
SELECT sum({*}) as *
NAMED 'nb_occurrences'
FROM bag_of_words
WHERE rowName() NOT LIKE 'kw%' -- ignoring the keyword rows we added
))
ORDER BY nb_occurrences DESC
LIMIT 20
""").sort_values('nb_occurrences', ascending=True).plot(kind='barh', figsize=(15,6))
Out[9]:
The word2vec tool is used to embed words into a high dimensional space. Concretely, this means that we can map words into a 300-dimensions space in which "similar" words will be close, and "dissimilar" words will be far away.
You can obtain the trained word2vec embedding here. Since the original data is quite big, and we really just need a small subset of it (the coordinates for the words that actually appear in our press releases), we are going to load only that portion, that we have extracted beforehand.
In [10]:
print mldb.post('/v1/procedures', {
'type': 'import.text',
'params': {
'dataFileUrl': 'http://public.mldb.ai/datasets/press_releases/GoogleNewsEmbedding_ElectionsDemo.csv.gz',
'outputDataset': 'w2v',
'select': '* EXCLUDING(word)',
'named': 'word'
}
})
Here is what it looks like.
In [11]:
mldb.query("SELECT * FROM w2v LIMIT 5")
Out[11]:
Word2vec gives us word embeddings, but since we are interested is press releases, we need to find a way to embed the document itself in the space. The most obvious way is to take the centroid of all the words in a document as the representation of that document in the 300-dimensions space.
We can achieve this using a pooling
function.
In [12]:
print mldb.put('/v1/functions/pool', {
'type': 'pooling',
'params': {
'aggregators': ['avg'],
'embeddingDataset': 'w2v'
}
})
In [13]:
print mldb.post('/v1/procedures', {
'type': 'transform',
'params': {
'inputData': """
SELECT pool({words: {*}})[embedding] AS *
FROM bag_of_words
""",
'outputDataset': 'word2vec'
}
})
In the new word2vec
dataset, each row is still a press release and the columns are the word2vec coordinates.
In [14]:
mldb.query("SELECT * FROM word2vec LIMIT 5")
Out[14]:
The t-SNE algorithm is a very powerful technique that can be used to reduce the dimensionality of data to 2D or 3D, typically to visualize it. We will use it to map our press releases on a 2D scatter plot.
In [15]:
print mldb.post('/v1/procedures', {
'type': 'tsne.train',
'params': {
'trainingData': 'SELECT * FROM word2vec',
'rowOutputDataset': 'tsne',
'perplexity': 5,
'modelFileUrl': 'file://pr_tsne.bin.gz'
}
})
In [16]:
mldb.query("SELECT * FROM tsne LIMIT 5")
Out[16]:
In [17]:
pr = mldb.query("""
SELECT raw.title as title, raw.party as party, tsne.x as x, tsne.y as y
NAMED raw.rowName()
FROM raw
JOIN tsne
ON tsne.rowName() = raw.rowName()
""")
kw = mldb.query("""
SELECT keywords.party as party, keywords.title as title, tsne.x as x, tsne.y as y
NAMED keywords.rowName()
FROM keywords
JOIN tsne
ON tsne.rowName() = keywords.rowName()
""")
pr[:5]
Out[17]:
And now let's create our final plot! We simply use Bokeh to make a scatter plot of the press releases, where we color each point by the color of its party. We then add our keywords on top of that, using the coordinates given by t-SNE.
In [18]:
import bokeh.plotting as bp
from bokeh.models import HoverTool
In [19]:
#this line must be in its own cell
bp.output_notebook()
In [20]:
fig = bp.figure(
plot_width=900, plot_height=700, x_axis_type=None, y_axis_type=None, min_border=1,
title="Press Releases of Canadian Federal Parties During 2015 Elections",
tools = [HoverTool(tooltips=[('title', '@title')])], toolbar_location=None
)
# the colored points
colormap = {"ndp": "#FF8000", "liberal": "#DF0101", "conservative": "#0000FF", "green": "#01DF01"}
for party, color in colormap.iteritems():
fig.scatter(
x='x', y='y', color=color, radius=1, fill_alpha=0.5, legend=party,
source=bp.ColumnDataSource(pr[pr['party'] == party])
)
# add some subtle white background for the text to be more visible
fig.rect(
x='x', y='y', color='white', alpha=.6, height=5,
width = [1 + len(w)*2.5 for w in kw['title'].values], # heuristic
source=bp.ColumnDataSource(kw)
)
# the keywords on top of the scatter plot
fig.text(
x='x', y='y', text='title', text_font_size='11pt', text_color='black', text_font='courier',
text_align='center', text_baseline='middle', source=bp.ColumnDataSource(kw)
)
bp.show(fig)
Out[20]:
This plot is interactive. You can explore it by hovering over the different press releases and the tooltip will show the title of each press release. Now have fun spotting thematic biases across parties!